Using readxl
The Massachusetts Bay Transportation Authority (“MBTA” or just “the T” for short) manages America’s oldest subway, as well as Greater Boston’s commuter rail, ferry, and bus systems.
It’s your first day on the job as the T’s data analyst and you’ve been tasked with analyzing average ridership through time. You’re in luck, because this chapter of the course will guide you through cleaning a set of MBTA ridership data!
The dataset is stored as an Excel spreadsheet called mbta.xlsx in your working directory. You’ll use the read_excel() function from Hadley Wickham’s readxl package to import it.
The first time you import a dataset, you might not know how many rows need to be skipped. In this case, the first row is a title, so you’ll need to skip the first row.
# Load readxl
library(readxl)
# Import mbta.xlsx and skip first row: mbta
mbta <- read_excel("../xDatasets/mbta.xlsx", skip = 1)Notice that the first row isn’t actually data, but specifying skip = 1 fixes that.
Examining the data
Your new boss at the T has tasked you with analyzing the ridership data. Of course, you’re going to clean the dataset first. The first step when cleaning a dataset is to explore it a bit.
The mbta data frame is already loaded in your workspace. Pay particular attention to how the rows and columns are organized and to the locations of missing values.
# View the structure of mbta
str(mbta, give.attr = FALSE)## Classes 'tbl_df', 'tbl' and 'data.frame': 11 obs. of 60 variables:
## $ ..1 : num 1 2 3 4 5 6 7 8 9 10 ...
## $ mode : chr "All Modes by Qtr" "Boat" "Bus" "Commuter Rail" ...
## $ 2007-01: chr "NA" "4" "335.819" "142.2" ...
## $ 2007-02: chr "NA" "3.6" "338.675" "138.5" ...
## $ 2007-03: num 1188 40 340 138 459 ...
## $ 2007-04: chr "NA" "4.3" "352.162" "139.5" ...
## $ 2007-05: chr "NA" "4.9" "354.367" "139" ...
## $ 2007-06: num 1246 5.8 350.5 143 477 ...
## $ 2007-07: chr "NA" "6.521" "357.519" "142.391" ...
## $ 2007-08: chr "NA" "6.572" "355.479" "142.364" ...
## $ 2007-09: num 1256.57 5.47 372.6 143.05 499.57 ...
## $ 2007-10: chr "NA" "5.145" "368.847" "146.542" ...
## $ 2007-11: chr "NA" "3.763" "330.826" "145.089" ...
## $ 2007-12: num 1216.89 2.98 312.92 141.59 448.27 ...
## $ 2008-01: chr "NA" "3.175" "340.324" "142.145" ...
## $ 2008-02: chr "NA" "3.111" "352.905" "142.607" ...
## $ 2008-03: num 1253.52 3.51 361.15 137.45 494.05 ...
## $ 2008-04: chr "NA" "4.164" "368.189" "140.389" ...
## $ 2008-05: chr "NA" "4.015" "363.903" "142.585" ...
## $ 2008-06: num 1314.82 5.19 362.96 142.06 518.35 ...
## $ 2008-07: chr "NA" "6.016" "370.921" "145.731" ...
## $ 2008-08: chr "NA" "5.8" "361.057" "144.565" ...
## $ 2008-09: num 1307.04 4.59 389.54 141.91 517.32 ...
## $ 2008-10: chr "NA" "4.285" "357.974" "151.957" ...
## $ 2008-11: chr "NA" "3.488" "345.423" "152.952" ...
## $ 2008-12: num 1232.65 3.01 325.77 140.81 446.74 ...
## $ 2009-01: chr "NA" "3.014" "338.532" "141.448" ...
## $ 2009-02: chr "NA" "3.196" "360.412" "143.529" ...
## $ 2009-03: num 1209.79 3.33 353.69 142.89 467.22 ...
## $ 2009-04: chr "NA" "4.049" "359.38" "142.34" ...
## $ 2009-05: chr "NA" "4.119" "354.75" "144.225" ...
## $ 2009-06: num 1233.1 4.9 347.9 142 473.1 ...
## $ 2009-07: chr "NA" "6.444" "339.477" "137.691" ...
## $ 2009-08: chr "NA" "5.903" "332.661" "139.158" ...
## $ 2009-09: num 1230.5 4.7 374.3 139.1 500.4 ...
## $ 2009-10: chr "NA" "4.212" "385.868" "137.104" ...
## $ 2009-11: chr "NA" "3.576" "366.98" "129.343" ...
## $ 2009-12: num 1207.85 3.11 332.39 126.07 440.93 ...
## $ 2010-01: chr "NA" "3.207" "362.226" "130.91" ...
## $ 2010-02: chr "NA" "3.195" "361.138" "131.918" ...
## $ 2010-03: num 1208.86 3.48 373.44 131.25 483.4 ...
## $ 2010-04: chr "NA" "4.452" "378.611" "131.722" ...
## $ 2010-05: chr "NA" "4.415" "380.171" "128.8" ...
## $ 2010-06: num 1244.41 5.41 363.27 129.14 490.26 ...
## $ 2010-07: chr "NA" "6.513" "353.04" "122.935" ...
## $ 2010-08: chr "NA" "6.269" "343.688" "129.732" ...
## $ 2010-09: num 1225.5 4.7 381.6 132.9 521.1 ...
## $ 2010-10: chr "NA" "4.402" "384.987" "131.033" ...
## $ 2010-11: chr "NA" "3.731" "367.955" "130.889" ...
## $ 2010-12: num 1216.26 3.16 326.34 121.42 450.43 ...
## $ 2011-01: chr "NA" "3.14" "334.958" "128.396" ...
## $ 2011-02: chr "NA" "3.284" "346.234" "125.463" ...
## $ 2011-03: num 1223.45 3.67 380.4 134.37 516.73 ...
## $ 2011-04: chr "NA" "4.251" "380.446" "134.169" ...
## $ 2011-05: chr "NA" "4.431" "385.289" "136.14" ...
## $ 2011-06: num 1302.41 5.47 376.32 135.58 529.53 ...
## $ 2011-07: chr "NA" "6.581" "361.585" "132.41" ...
## $ 2011-08: chr "NA" "6.733" "353.793" "130.616" ...
## $ 2011-09: num 1291 5 388 137 550 ...
## $ 2011-10: chr "NA" "4.484" "398.456" "128.72" ...
# View the first 6 rows of mbta
mbta %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| ..1 | mode | 2007-01 | 2007-02 | 2007-03 | 2007-04 | 2007-05 | 2007-06 | 2007-07 | 2007-08 | 2007-09 | 2007-10 | 2007-11 | 2007-12 | 2008-01 | 2008-02 | 2008-03 | 2008-04 | 2008-05 | 2008-06 | 2008-07 | 2008-08 | 2008-09 | 2008-10 | 2008-11 | 2008-12 | 2009-01 | 2009-02 | 2009-03 | 2009-04 | 2009-05 | 2009-06 | 2009-07 | 2009-08 | 2009-09 | 2009-10 | 2009-11 | 2009-12 | 2010-01 | 2010-02 | 2010-03 | 2010-04 | 2010-05 | 2010-06 | 2010-07 | 2010-08 | 2010-09 | 2010-10 | 2010-11 | 2010-12 | 2011-01 | 2011-02 | 2011-03 | 2011-04 | 2011-05 | 2011-06 | 2011-07 | 2011-08 | 2011-09 | 2011-10 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | All Modes by Qtr | NA | NA | 1187.653 | NA | NA | 1245.959 | NA | NA | 1256.571 | NA | NA | 1216.890 | NA | NA | 1253.522 | NA | NA | 1314.821 | NA | NA | 1307.041 | NA | NA | 1232.655 | NA | NA | 1209.792 | NA | NA | 1233.085 | NA | NA | 1230.461 | NA | NA | 1207.845 | NA | NA | 1208.857 | NA | NA | 1244.409 | NA | NA | 1225.527 | NA | NA | 1216.262 | NA | NA | 1223.452 | NA | NA | 1302.414 | NA | NA | 1290.549 | NA |
| 2 | Boat | 4 | 3.6 | 40.000 | 4.3 | 4.9 | 5.800 | 6.521 | 6.572 | 5.469 | 5.145 | 3.763 | 2.985 | 3.175 | 3.111 | 3.512 | 4.164 | 4.015 | 5.189 | 6.016 | 5.8 | 4.587 | 4.285 | 3.488 | 3.007 | 3.014 | 3.196 | 3.330 | 4.049 | 4.119 | 4.900 | 6.444 | 5.903 | 4.696 | 4.212 | 3.576 | 3.113 | 3.207 | 3.195 | 3.481 | 4.452 | 4.415 | 5.411 | 6.513 | 6.269 | 4.699 | 4.402 | 3.731 | 3.156 | 3.14 | 3.284 | 3.674 | 4.251 | 4.431 | 5.474 | 6.581 | 6.733 | 5.003 | 4.484 |
| 3 | Bus | 335.819 | 338.675 | 339.867 | 352.162 | 354.367 | 350.543 | 357.519 | 355.479 | 372.598 | 368.847 | 330.826 | 312.920 | 340.324 | 352.905 | 361.155 | 368.189 | 363.903 | 362.962 | 370.921 | 361.057 | 389.537 | 357.974 | 345.423 | 325.767 | 338.532 | 360.412 | 353.686 | 359.38 | 354.75 | 347.865 | 339.477 | 332.661 | 374.260 | 385.868 | 366.98 | 332.394 | 362.226 | 361.138 | 373.443 | 378.611 | 380.171 | 363.275 | 353.04 | 343.688 | 381.622 | 384.987 | 367.955 | 326.338 | 334.958 | 346.234 | 380.399 | 380.446 | 385.289 | 376.317 | 361.585 | 353.793 | 388.271 | 398.456 |
| 4 | Commuter Rail | 142.2 | 138.5 | 137.700 | 139.5 | 139 | 143.000 | 142.391 | 142.364 | 143.051 | 146.542 | 145.089 | 141.585 | 142.145 | 142.607 | 137.453 | 140.389 | 142.585 | 142.057 | 145.731 | 144.565 | 141.907 | 151.957 | 152.952 | 140.810 | 141.448 | 143.529 | 142.893 | 142.34 | 144.225 | 142.006 | 137.691 | 139.158 | 139.087 | 137.104 | 129.343 | 126.066 | 130.91 | 131.918 | 131.252 | 131.722 | 128.8 | 129.144 | 122.935 | 129.732 | 132.892 | 131.033 | 130.889 | 121.422 | 128.396 | 125.463 | 134.374 | 134.169 | 136.14 | 135.581 | 132.41 | 130.616 | 136.901 | 128.72 |
| 5 | Heavy Rail | 435.294 | 448.271 | 458.583 | 472.201 | 474.579 | 477.032 | 471.735 | 461.605 | 499.566 | 457.741 | 488.348 | 448.268 | 472.624 | 492.1 | 494.046 | 513.204 | 507.952 | 518.349 | 512.309 | 476.99 | 517.324 | 523.644 | 487.115 | 446.743 | 461.004 | 482.407 | 467.224 | 493.152 | 475.634 | 473.099 | 470.828 | 466.676 | 500.403 | 513.406 | 480.278 | 440.925 | 464.069 | 480.121 | 483.397 | 502.374 | 487.4 | 490.263 | 488.587 | 473.731 | 521.099 | 532.403 | 502.887 | 450.433 | 468.418 | 504.068 | 516.730 | 528.631 | 528.122 | 529.528 | 532.888 | 508.145 | 550.137 | 554.932 |
| 6 | Light Rail | 227.231 | 240.262 | 241.444 | 255.557 | 248.262 | 246.108 | 243.286 | 234.907 | 265.748 | 241.434 | 250.497 | 233.379 | 241.223 | 249.306 | 253.132 | 271.07 | 258.351 | 266.961 | 270.158 | 239.344 | 258.171 | 250.063 | 232.068 | 205.420 | 215.66 | 228.737 | 222.844 | 238.232 | 224.962 | 226.259 | 230.308 | 231.783 | 250.922 | 230.739 | 214.711 | 194.446 | 204.396 | 213.136 | 211.693 | 227.246 | 217.805 | 215.922 | 218.729 | 210.53 | 236.368 | 236.366 | 221.881 | 196.211 | 198.45 | 219.886 | 227.935 | 242.28 | 225.776 | 221.865 | 231.01 | 220.164 | 244.949 | 237.768 |
# View a summary of mbta
sum_mbta <- as.data.frame(do.call(cbind, lapply(mbta, summary)))
sum_mbta[,-1] %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| mode | 2007-01 | 2007-02 | 2007-03 | 2007-04 | 2007-05 | 2007-06 | 2007-07 | 2007-08 | 2007-09 | 2007-10 | 2007-11 | 2007-12 | 2008-01 | 2008-02 | 2008-03 | 2008-04 | 2008-05 | 2008-06 | 2008-07 | 2008-08 | 2008-09 | 2008-10 | 2008-11 | 2008-12 | 2009-01 | 2009-02 | 2009-03 | 2009-04 | 2009-05 | 2009-06 | 2009-07 | 2009-08 | 2009-09 | 2009-10 | 2009-11 | 2009-12 | 2010-01 | 2010-02 | 2010-03 | 2010-04 | 2010-05 | 2010-06 | 2010-07 | 2010-08 | 2010-09 | 2010-10 | 2010-11 | 2010-12 | 2011-01 | 2011-02 | 2011-03 | 2011-04 | 2011-05 | 2011-06 | 2011-07 | 2011-08 | 2011-09 | 2011-10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. | 11 | 11 | 11 | 0.114 | 11 | 11 | 0.096 | 11 | 11 | -0.007 | 11 | 11 | -0.06 | 11 | 11 | 0.058 | 11 | 11 | 0.06 | 11 | 11 | 0.021 | 11 | 11 | -0.015 | 11 | 11 | -0.05 | 11 | 11 | -0.079 | 11 | 11 | -0.035 | 11 | 11 | -0.022 | 11 | 11 | 0.012 | 11 | 11 | 0.008 | 11 | 11 | 0.001 | 11 | 11 | -0.004 | 11 | 11 | 0.05 | 11 | 11 | 0.054 | 11 | 11 | 0.043 | 11 |
| 1st Qu. | character | character | character | 9.2785 | character | character | 5.7 | character | character | 5.539 | character | character | 4.385 | character | character | 5.1695 | character | character | 5.7415 | character | character | 5.6905 | character | character | 4.6885 | character | character | 5.0025 | character | character | 5.845 | character | character | 5.693 | character | character | 4.7835 | character | character | 5.2735 | character | character | 6.4355 | character | character | 5.5675 | character | character | 4.4665 | character | character | 6.0305 | character | character | 6.9255 | character | character | 6.6605 | character |
| Median | character | character | character | 137.7 | character | character | 143 | character | character | 143.051 | character | character | 141.585 | character | character | 137.453 | character | character | 142.057 | character | character | 141.907 | character | character | 140.81 | character | character | 142.893 | character | character | 142.006 | character | character | 139.087 | character | character | 126.066 | character | character | 131.252 | character | character | 129.144 | character | character | 132.892 | character | character | 121.422 | character | character | 134.374 | character | character | 135.581 | character | character | 136.901 | character |
| Mean | 11 | 11 | 11 | 330.292454545455 | 11 | 11 | 339.846545454545 | 11 | 11 | 352.553727272727 | 11 | 11 | 321.587636363636 | 11 | 11 | 345.603818181818 | 11 | 11 | 359.666909090909 | 11 | 11 | 362.099363636364 | 11 | 11 | 319.882545454545 | 11 | 11 | 330.142363636364 | 11 | 11 | 333.194090909091 | 11 | 11 | 346.686727272727 | 11 | 11 | 312.962545454545 | 11 | 11 | 332.725454545455 | 11 | 11 | 335.964 | 11 | 11 | 346.523818181818 | 11 | 11 | 312.917090909091 | 11 | 11 | 345.165545454545 | 11 | 11 | 353.330727272727 | 11 | 11 | 362.554545454545 | 11 |
| 3rd Qu. | character | character | character | 399.225 | character | character | 413.7875 | character | character | 436.082 | character | character | 380.594 | character | character | 427.6005 | character | character | 440.6555 | character | character | 453.4305 | character | character | 386.255 | character | character | 410.455 | character | character | 410.482 | character | character | 437.3315 | character | character | 386.6595 | character | character | 428.42 | character | character | 426.769 | character | character | 451.3605 | character | character | 388.3855 | character | character | 448.5645 | character | character | 452.9225 | character | character | 469.204 | character |
| Max. | character | character | character | 1204.725 | character | character | 1246.129 | character | character | 1310.764 | character | character | 1216.89 | character | character | 1274.031 | character | character | 1320.728 | character | character | 1338.015 | character | character | 1232.655 | character | character | 1210.912 | character | character | 1233.085 | character | character | 1291.564 | character | character | 1207.845 | character | character | 1225.556 | character | character | 1244.409 | character | character | 1293.117 | character | character | 1216.262 | character | character | 1286.66 | character | character | 1302.414 | character | character | 1348.754 | character |
Do you notice anything strange about how the rows and columns are organized?
Removing unnecessary rows and columns
It appears that the data are organized with observations stored as columns rather than as rows. You can fix that.
First, though, you can address the missing data. All of the NA values are stored in the All Modes by Qtr row. This row really belongs in a different data frame; it is a quarterly average of weekday MBTA ridership. Since this dataset tracks monthly average ridership, you’ll remove that row.
Similarly, the 7th row (Pct Chg / Yr) and the 11th row (TOTAL) are not really observations as much as they are analysis. Go ahead and remove the 7th and 11th rows as well.
The first column also needs to be removed because it’s just listing the row numbers.
In case you were wondering, this dataset is stored as a tibble which is just a specific type of data frame.
# Remove rows 1, 7, and 11 of mbta: mbta2
mbta2 <- mbta[-c(1,7,11),]
# Remove the first column of mbta2: mbta3
mbta3 <- mbta2[,-1]Observations are stored in columns
Recall from a few exercises back that in your T ridership data, variables are stored in rows instead of columns. If you forget what that looked like, go ahead and enter head(mbta3) in the console and/or look at the screenshot.
The different modes of transportation (commuter rail, bus, subway, ferry, …) are variables, providing information about each month’s average ridership. The months themselves are observations. You can tell which is which because as you go through time, the month changes, but the modes of transport offered by the T do not.
As is customary, you want to represent variables in columns rather than rows. The first step is to use the gather() function from the tidyr package, which will gather columns into key-value pairs.
# Load tidyr
library(tidyr)
# Gather columns of mbta3: mbta4
mbta4 <- gather(mbta3, month, thou_riders, -mode)
# View the head of mbta4
mbta4 %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| mode | month | thou_riders |
|---|---|---|
| Boat | 2007-01 | 4 |
| Bus | 2007-01 | 335.819 |
| Commuter Rail | 2007-01 | 142.2 |
| Heavy Rail | 2007-01 | 435.294 |
| Light Rail | 2007-01 | 227.231 |
| Private Bus | 2007-01 | 4.772 |
| RIDE | 2007-01 | 4.9 |
| Trackless Trolley | 2007-01 | 12.757 |
| Boat | 2007-02 | 3.6 |
| Bus | 2007-02 | 338.675 |
Before moving on, have a look at the console. Your dataset is long now – notice that the first column still stores variable names, but now, months have their own column instead of being headers. Also notice the data type of each column.
Type conversions
In a minute, you’ll put variables where they belong (as column names). But first, take this opportunity to change the average weekday ridership column, thou_riders, into numeric values rather than character strings. That way, you’ll be able to do things like compare values and do math.
# Coerce thou_riders to numeric
mbta4$thou_riders <- as.numeric(mbta4$thou_riders)Variables are stored in both rows and columns
Now, you can finish the job you started earlier: getting variables into columns. Right now, variables are stored as “keys” in the mode column. You’ll use the tidyr function spread() to make them into columns containing average weekday ridership for the given month and mode of transport.
# Spread the contents of mbta4: mbta5
mbta5 <- spread(mbta4, mode, thou_riders)
# View the head of mbta5
mbta5 %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| month | Boat | Bus | Commuter Rail | Heavy Rail | Light Rail | Private Bus | RIDE | Trackless Trolley |
|---|---|---|---|---|---|---|---|---|
| 2007-01 | 4.000 | 335.819 | 142.200 | 435.294 | 227.231 | 4.772 | 4.900 | 12.757 |
| 2007-02 | 3.600 | 338.675 | 138.500 | 448.271 | 240.262 | 4.417 | 5.000 | 12.913 |
| 2007-03 | 40.000 | 339.867 | 137.700 | 458.583 | 241.444 | 4.574 | 5.500 | 13.057 |
| 2007-04 | 4.300 | 352.162 | 139.500 | 472.201 | 255.557 | 4.542 | 5.400 | 13.444 |
| 2007-05 | 4.900 | 354.367 | 139.000 | 474.579 | 248.262 | 4.768 | 5.400 | 13.479 |
| 2007-06 | 5.800 | 350.543 | 143.000 | 477.032 | 246.108 | 4.722 | 5.600 | 13.323 |
| 2007-07 | 6.521 | 357.519 | 142.391 | 471.735 | 243.286 | 3.936 | 5.253 | 13.311 |
| 2007-08 | 6.572 | 355.479 | 142.364 | 461.605 | 234.907 | 3.946 | 5.308 | 13.142 |
| 2007-09 | 5.469 | 372.598 | 143.051 | 499.566 | 265.748 | 4.329 | 5.609 | 14.393 |
| 2007-10 | 5.145 | 368.847 | 146.542 | 457.741 | 241.434 | 4.315 | 5.806 | 14.622 |
Notice how mbta5 now has columns for modes of transportation with values of average weekday ridership for the given months.
Separating columns
Your dataset is already looking much better! Your boss saw what a great job you’re doing and now wants you to do an analysis of the T’s ridership during certain months across all years.
Your dataset has month names in it, so that analysis will be a piece of cake. There’s only one small problem: if you want to look at ridership on the T during every January (for example), the month and year are together in the same column, which makes it a little tricky.
In this exercise, you’ll separate the month column into distinct month and year columns to make life easier.
# Split month column into month and year: mbta6
mbta6 <- separate(mbta5, month, c("year", "month"))
# View the head of mbta6
mbta6 %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| year | month | Boat | Bus | Commuter Rail | Heavy Rail | Light Rail | Private Bus | RIDE | Trackless Trolley |
|---|---|---|---|---|---|---|---|---|---|
| 2007 | 01 | 4.000 | 335.819 | 142.200 | 435.294 | 227.231 | 4.772 | 4.900 | 12.757 |
| 2007 | 02 | 3.600 | 338.675 | 138.500 | 448.271 | 240.262 | 4.417 | 5.000 | 12.913 |
| 2007 | 03 | 40.000 | 339.867 | 137.700 | 458.583 | 241.444 | 4.574 | 5.500 | 13.057 |
| 2007 | 04 | 4.300 | 352.162 | 139.500 | 472.201 | 255.557 | 4.542 | 5.400 | 13.444 |
| 2007 | 05 | 4.900 | 354.367 | 139.000 | 474.579 | 248.262 | 4.768 | 5.400 | 13.479 |
| 2007 | 06 | 5.800 | 350.543 | 143.000 | 477.032 | 246.108 | 4.722 | 5.600 | 13.323 |
| 2007 | 07 | 6.521 | 357.519 | 142.391 | 471.735 | 243.286 | 3.936 | 5.253 | 13.311 |
| 2007 | 08 | 6.572 | 355.479 | 142.364 | 461.605 | 234.907 | 3.946 | 5.308 | 13.142 |
| 2007 | 09 | 5.469 | 372.598 | 143.051 | 499.566 | 265.748 | 4.329 | 5.609 | 14.393 |
| 2007 | 10 | 5.145 | 368.847 | 146.542 | 457.741 | 241.434 | 4.315 | 5.806 | 14.622 |
Do your values seem reasonable?
Before you write up the analysis for your boss, it’s a good idea to screen the data for any obvious mistakes and/or outliers.
There are many valid techniques for doing this; you’ll practice a couple of them here.
# View a summary of mbta6
sum_mbta6 <- as.data.frame(do.call(cbind, lapply(mbta6, summary)))
sum_mbta6[,-2] %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| year | Boat | Bus | Commuter Rail | Heavy Rail | Light Rail | Private Bus | RIDE | Trackless Trolley | |
|---|---|---|---|---|---|---|---|---|---|
| Min. | 58 | 2.985 | 312.92 | 121.422 | 435.294 | 194.446 | 2.213 | 4.9 | 5.777 |
| 1st Qu. | character | 3.494 | 345.62575 | 131.3695 | 471.05475 | 220.58925 | 2.64075 | 5.9655 | 11.67925 |
| Median | character | 4.2925 | 359.896 | 138.75 | 487.2575 | 231.9255 | 2.8195 | 6.615 | 12.5975 |
| Mean | 58 | 5.06796551724138 | 358.590448275862 | 137.351534482759 | 489.293482758621 | 232.988810344828 | 3.35218965517241 | 6.6036724137931 | 12.1250517241379 |
| 3rd Qu. | character | 5.3555 | 372.17875 | 142.358 | 511.268 | 244.53325 | 4.16725 | 7.1495 | 13.32 |
| Max. | character | 40 | 398.456 | 152.952 | 554.932 | 271.07 | 4.878 | 8.598 | 15.109 |
# Generate a histogram of Boat column
hist(mbta6$Boat)That’s quite an interesting histogram – every value clustered around 4 and one loner out around 40.
Dealing with entry error
Think for a minute about that Boat histogram. Every month, average weekday commuter boat ridership was on either side of four thousand. Then, one month it jumped to 40 thousand without warning?
Unless the Olympics were happening in Boston that month (they weren’t), this value is certainly an error. You can assume that whoever was entering the data that month accidentally typed 40 instead of 4.
Because it’s an error, you don’t want this value influencing your analysis. In this exercise, you’ll locate the incorrect value and change it to 4.
After you make the change, you’ll run the last two commands in the editor as-is. They use functions you may not know yet to produce some cool ridership plots: one showing the lesser-used modes of transport (take a look at the gorgeous seasonal variation in Boat ridership), and one showing all modes of transport. The plots are based on the long version of the data we produced in Exercise 4 – a good example of using different data formats for different purposes.
# Find the row number of the incorrect value: i
i <- which(mbta6$Boat > 20)
# Replace the incorrect value with 4
mbta6$Boat[i] <- 4
# Generate a histogram of Boat column
hist(mbta6$Boat)# Look at Boat and Trackless Trolley ridership over time (don't change)
j <- which(mbta4$mode == "Boat" | mbta4$mode == "Trackless Trolley")
mbta_boat <- mbta4[j, c("month", "thou_riders", "mode")]
ggplot(mbta_boat, aes(x = month, y = thou_riders, col = mode)) + geom_point() +
scale_x_discrete(name = "Month", breaks = c(200701, 200801, 200901, 201001, 201101)) +
scale_y_continuous(name = "Avg Weekday Ridership (thousands)")# Look at all T ridership over time (don't change)
mbta_all <- mbta4[, c("month", "thou_riders", "mode")]
ggplot(mbta_all, aes(x = month, y = thou_riders, col = mode)) + geom_point() +
scale_x_discrete(name = "Month", breaks = c(200701, 200801, 200901, 201001, 201101)) +
scale_y_continuous(name = "Avg Weekday Ridership (thousands)")